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Open 



The current histoclinical breast cancer classification is 
simple but imprecise. Several molecular classifications of 
breast cancers based on expression profiling have been 
proposed as alternatives. However, their reliability and 
clinical utility have been repeatedly questioned, notably 
because most of them were derived from relatively small 
initial patient populations. We analyzed the transcrip- 
tomes of 537 breast tumors using three unsupervised 
classification methods. A core subset of 355 tumors was 
assigned to six clusters by all three methods. These six 
subgroups overlapped with previously defined molecular 
classes of breast cancer, but also showed important 
differences, notably the absence of an ERBB2 subgroup 
and the division of the large luminal ER + group into four 
subgroups, two of them being highly proliferative. Of the 
six subgroups, four were ER + /PR + /AR + , one was 
ER-/PR-/AR+ and one was triple negative (AR-/ 
ER— /PR— ). ERBB2-amplified tumors were split between 
the ER-/PR-/AR + subgroup and the highly prolifera- 
tive ER + LumC subgroup. Importantly, each of these six 
molecular subgroups showed specific copy-number altera- 
tions. Gene expression changes were correlated to specific 
signaling pathways. Each of these six subgroups showed 
very significant differences in tumor grade, metastatic 
sites, relapse-free survival or response to chemotherapy. 
All these findings were validated on large external 
datasets including more than 3000 tumors. Our data thus 
indicate that these six molecular subgroups represent well- 
defined clinico-biological entities of breast cancer. Their 
identification should facilitate the detection of novel 



Correspondence: Dr C Theillet, Institut de Recherche en Cancerologie 

de Montpellier, INSERM U896, CRLC Val d'Aurelle-Paul Lamarque, 

34298 Montpellier 5, France. 

E-mail: charles.theillet@inserm.fr 

15 These authors contributed equally to this work 

Received 17 June 2010; revised and accepted 14 June 2011; published 

online 25 July 2011 



prognostic factors or therapeutical targets in breast 
cancer. 
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Introduction 

Breast cancer is heterogeneous. Biological features 
have proven insufficient for a comprehensive description 
of the disease. Seminal work by Sorlie et al (2003) has 
delineated five major molecular subtypes of breast 
cancer associated to different outcomes. This initial 
classification was reproduced in independent datasets 
(Bertucci et al., 2006) strongly suggesting the existence 
of distinct molecular entities in breast cancer. The Sorlie 
centroid approach has subsequently been redefined and 
adapted to more recent technological platforms (Hu 
et al, 2006; Parker et al, 2009). 

However, criticisms have pointed to the instability of 
the defined subtypes (Kapp et al, 2006; Weigelt et al, 
2010) and their dependence on the original set of 
samples or genes. Thus, although molecular classifica- 
tion brings interesting insights in breast cancer taxon- 
omy, its implementation in the clinics is put in doubt 
because of insufficient reliability in single sample 
allocation (Weigelt et al, 2010). Rather, three broad 
classes of breast tumors drawn along their ER, PR and 
ERBB2/HER2 status are commonly used in the clinic. 
ER-/PR-/HER2- tumors were defined as triple nega- 
tive, ER + /PR + /HER2- as luminal, and HER2 + 
tumors irrespective of their ER status form the third 
class (Foulkes et al, 2010). However, this simple 
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classification is also criticized because of the biological 
heterogeneity within classes. In particular, the corres- 
pondence between the triple-negative group and basal- 
like breast tumors and the heterogeneity of the large 
ER + /PR + group have been repeatedly questioned 
(Gusterson, 2009; Foulkes et al., 2010). This argues for a 
more elaborate stratification amenable to biological 
exploration and clinical choices. 

This prompted us to construct a robust molecular 
classification on a large number of samples to reach 
high statistical power. To this aim, we produced trans- 
criptomes of a series of 537 primary breast cancers and, 
using a semi-supervised analysis, revealed six stable 
molecular subgroups. A related classification rule was 
defined. Each of the six molecular subgroups showed 
distinct genomic changes, correlated with a specific set 
of signaling pathways and was associated with signi- 
ficant differences in tumor grade, metastatic sites and 
metastasis-free survival (MFS). We propose that this 
classification scheme could lay the bases of an operative 
tool to reliably classify breast cancers in more homo- 
geneous molecular subgroups. This classification could 
be highly beneficial in future investigations aiming at 
identifying novel prognostic factors or therapeutical 
targets in breast cancer. 



Results 

Semi-supervised gene expression analysis identifies 
six prototypic molecular subtypes 

Our aim was to identify molecular subgroups represent- 
ing homogeneous subsets of breast cancer. Our meth- 
odology is detailed in Supplementary Figure 1 and the 
Supplementary Methods section. Briefly, we produced a 
large dataset comprising 537 primary breast cancer 
transcriptomes on Affymetrix U133-Plus2.0 arrays to 
ensure proper statistical power. First, this tumor set was 
classified with three unsupervised methods (hierarchical 
clustering, Gaussian mixture models and £-means) in 
parallel. Of the 537 tumors, 355 yielded a consensus 
subgroup assignment (that is, were assigned to the 
same subclass) between all three methods. This subset 
was named coreset and was used for further analysis. 
Second, a minimal list of 256 discriminative genes 
with maximal intragroup homogeneity and intergroup 
heterogeneity was generated by analysis of variance 
(Supplementary Table 3). Hierarchical clustering based 
on this list delineated six homogeneous tumor sub- 
groups, homogeneity being confirmed by the principal 
component analysis (Figure lb). To allow the classifica- 
tion of independent sample profiles to one of the six 
subgroups we built a single sample predictor based on 
a distance-to-centroid approach (using the previously 
mentioned 256 genes; Supplementary Methods). The 
182 tumors of the discovery set lying outside of the 
coreset were classified using this single sample predictor. 

The overall distribution of the six subgroups was 
determined by three large gene clusters shared by at 
least two subgroups. The first one (cluster- VI, Figures 
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la and c, Supplementary Table 3) containing ESR1 and 
correlated genes, defined two ER-negative (ER-) and 
four ER-positive (ER + ) subgroups (Figures la and c). 
The second gene cluster (cluster-IV) included the andro- 
gen receptor (AR) gene and encompassed five subgroups. 
Of the six subgroups, four were ER + /PR + /AR + , one 
was ER-/PR-/AR + and one was triple negative (AR-/ 
ER-/PR-; Figures la and c). Interestingly, cluster-IV 
included transcription factors FOXA1, SPDEF and 
XBP1, which are usually associated to the ER-cluster 
(Bertucci et al, 2006). The third cluster (cluster-II) was 
predominantly composed of genes regulating DNA repli- 
cation and cell cycle progression, thus defining elevated 
cell proliferation. This cluster encompassed both ER— 
and two ER+ subgroups (Figures la and c). 

Each subgroup was defined by a specific gene cluster 
(Supplementary Figure 2) in which we found genes 
previously part of the Sorlie centroids. Hence, for 
simplicity we named our subgroups according to the 
Sorlie subtype (Sorlie et al, 2003). ER+ subgroups 
were split according to expression levels of the cell 
cycle cluster. Low proliferative ER+ subgroups were 
differentiated by clusters-Ill and IX (Figure la, Supple- 
mentary Figure 2), comprising respectively genes from 
the Sorlie luminal-A and normal-like centroids (Supple- 
mentary Table 3) and were, thus, designated LumA and 
NormL. The two high proliferation ER+ subgroups 
differed sharply in ER-cluster expression levels. The 
subgroup expressing highest levels of ER was named 
LumB. The other subgroup, positioned at the boundary 
between ER + and ER- tumors, was designated LumC 
(Figures la-c). Noteworthy, 40% of LumC tumors 
overexpressed the ERBB2/HER2 gene. 

Next was the AR + /ER— /PR— subgroup (Figure lb), 
defined by cluster- VIII. The AR + /ER- status of this 
subgroup was reminiscent of the previously described 
'molecular-apocrine' subtype (Farmer et al, 2005) and 
we designated it mApo. Although ERBB2/HER2 was 
overexpressed by 72% of the tumors in this sub- 
group, cluster- VIII did not comprise genes co-amplified 
with ERBB2/HER2. In fact, ERBB2/HER2 + tumors 
distributed in mApo and LumC subgroups (Table 1). 
Finally, the AR-/ER-/PR- subgroup, defined by 
cluster-I, presented the greatest distance to all others 
(Figure 1). As it shared genes with the 'basal-like' 
subtype, it was designated BasL (Supplementary Table 3). 

Molecular subgroups show distinct clinical correlations, 
metastatic sites and outcomes 

BasL and mApo at one end of the spectrum, and LumA 
and NormL at the other end showed an inverse balance 
between high-grade and ER/PR positivity (Table 1). 
TP53 mutation incidence reached 83% in the BasL 
subgroup and gradually went down to 4% in NormL 
and LumA tumors (Table 1). This distribution of high- 
grade/ER— versus low-grade/ER + cancers was also 
coherent with the median age of onset: 50 and 62 for 
BasL and LumA patients, respectively. Correlation with 
histological type was observed as well. While the BasL 
subgroup was composed of 98% ductal carcinomas, 
NormL presented 19% of invasive lobular tumors, 
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Cluster ii (cell cycle) 




Figure 1 Breast tumor classification according to the CIT classification into six subgroups of tumors, (a) Heatmap representing the 
expression of the 256 genes (nine clusters of genes represented by vertical color bars on the left of the heatmap) through the six groups, 
(b) Principal component analysis (PCA) of the samples of the coreset according to the 256 gene signature. The first principal 
component (PCI) represents the combined expression of the three transversal clusters (ER, AR and cell cycle), the second component 
(PC2) differentiates LumB and NormL. (c) Distribution of mean expression levels of the three transversal gene clusters (ER, AR and 
Cell Cycle) over the six main molecular subgroups, (d) Comparison of the CIT classification with those obtained using the Sorlie, Hu, 
Parker and Jonsson systems. 



representing 53% of all lobular cancers in the dataset, in 
coherence with previous findings (Bertucci et al., 2008). 

Molecular subgroups showed differences in sites of 
metastatic relapse. In line with previous studies (Smid 
et al., 2008), LumA and NormL predominantly meta- 
stasized to the bone and rarely or never to the brain, 
while BasL and mApo tumors metastasized to the brain 
and less to the bones (Table 1). ST6GALNAC5, COX2/ 
PTGS2 and HBEGF, whose expression has recently 
been associated to brain metastasis (Bos et al, 2009), 
were increased in BasL (Supplementary Figure 3). Clear 
differences were also found in MFS (Figure 2). BasL 
and mApo subgroups showed earliest recurrence (18 to 
60 months). LumA and NormL had the slowest course. 
Metastatic recurrence plateaued between 60 and 180 



months in BasL and mApo, whereas it progressively 
increased after 60 months in ER+ subgroups. LumA 
and NormL tumors presented recurrences after 120 
months post-surgery. Interestingly, patterns of recur- 
rence (early versus late) matched cell cycle cluster 
expression levels in the different subgroups. 

Performance on external datasets 

We applied our classification scheme to a large 
Affymetrix dataset comprising 2291 breast cancer 
transcriptomes we have collected from the literature 
(Supplementary Methods). The six molecular subgroups 
were perfectly reproduced, both in terms of distribution 
and clinical correlations and outcomes (Supplementary 
Table 4a, Figure 2b). To further ascertain its robustness, 
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Table 1 Molecular subgroups show differential correlation to breast cancer clinico-biological parameters and different sites of metastatic relapse 



CIT classification 



Variable 


pv 


BasL 


mApo 


LumC 


LumB 


LumA 


NormL 


Total 




53 


39 


48 


66 


61 


88 


ER+ (me) 


l.OOE-50 


5 (10%) 


1 (3%) 


37 (84%) 


63 (98%) 


58 (97%) 


81 (93%) 


ER— (ihc) 




46 (90%) 


35 (97%) 


7 (16%) 


1 (2%) 


2 (3%) 


6 (7%) 


ER+ (exp) 


6.00E-68 


3 (6%) 


2 (5%) 


48 (100%) 


66 (100%) 


61 (100%) 


87 (99%) 


ER— (exp) 




50 (94%) 


37 (95%) 


0 (0%) 


0 (0%) 


0 (0%) 


1 (1%) 


PR+ (IHC) 


2.00E-25 


4 (8%) 


1 (3%) 


25 (54%) 


43 (67%) 


53 (88%) 


62 (71%) 


PR - (IHC) 




48 (92%) 


34 (97%) 


21 (46%) 


21 (33%) 


7 (12%) 


25 (29%) 


PR+ (EXP) 


l.OOE-37 


5 (9%) 


5 (13%) 


32 (67%) 


47 (71%) 


58 (95%) 


85 (97%) 


PR - (EXP) 




48 (91%) 


34 (87%) 


16 (33%) 


19 (29%) 


3 (5%) 


3 (3%) 


ERBB2+ ( IH q 


9.00E-19 


3 (7%) 


19 (68%) 


10 (26%) 


5 (11%) 


0 (0%) 


0 (0%) 


ERBB2— (me) 




43 (93%) 


9 (32%) 


28 (74%) 


41 (89%) 


37 (100%) 


74 (100%) 


ERBB2+ (exp) 


4.00E-31 


2 (4%) 


29 (74%) 


20 (42%) 


2 (3%) 


0 (0%) 


5 (6%) 


lLikddZ (EXP) 




51 (96%) 


10 (26%) 


TO /CO 0/ \ 

28 (58%) 


64 (97%) 


C 1 /1 AAO / \ 

61 (100%) 


o o /n ao/ \ 

83 (94%) 


AR + (EXP) 


2.00E-57 


2 (4%) 


32 (82%) 


47 (98%) 


63 (95%) 


61 (100%) 


88 (100%) 


AR— (exp) 




51 (96%) 


7 (18%) 


1 (2%) 


3 (5%) 


0 (0%) 


0 (0%) 


P53mut 


l.OOE-15 


29 (83%) 


13 (72%) 


24 (69%) 


5 (16%) 


1 (4%) 


1 (5%) 


P53wt 




6 (17%) 


5 (28%) 


11 (31%) 


27 (84%) 


27 (96%) 


21 (95%) 


Ductal 


0.05 


51 (98%) 


32 (84%) 


39 (87%) 


54 (84%) 


50 (83%) 


61 (77%) 


Lobular 


0.004 


1 (2%) 


1 (3%) 


3 (7%) 


3 (5%) 


5 (8%) 


15 (19%) 


Other 


0.1 


0 (0%) 


5 (13%) 


3 (7%) 


7 (11%) 


5 (8%) 


3 (4%) 


SBR Grade l 


8.00E-H 


0 (0%) 


0 (0%) 


0 (0%) 


0 (0%) 


7 (12%) 


23 (27%) 


SBR Grade 2 


2.00E-13 


6 (11%) 


8 (21%) 


21 (47%) 


38 (58%) 


44 (77%) 


53 (62%) 


SBR Grade 3 


4.00E-26 


47 (89%) 


30 (79%) 


24 (53%) 


28 (42%) 


6 (11%) 


9 (11%) 


Age (median) 


4.00E-07 


50 


56 


54 


57 


62 


52 


MR 5year 


0.001 


17 (36%) 


14 (38%) 


11 (34%) 


15 (26%) 


9 (20%) 


6 (8%) 


MR I5year 


0.01 


17 (36%) 


14 (38%) 


13 (41%) 


18 (32%) 


10 (22%) 


11 (15%) 


Bones 


0.01 


4 (24%) 


8 (57%) 


7 (54%) 


14 (78%) 


7 (70%) 


9 (82%) 


Brain 


0.06 


5 (29%) 


3 (21%) 


1 (8%) 


0 (0%) 


0 (0%) 


2 (18%) 


Liver 


0.7 


5 (29%) 


6 (43%) 


7 (54%) 


8 (44%) 


3 (30%) 


3 (27%) 


Lung 


0.9 


6 (35%) 


4 (29%) 


6 (46%) 


8 (44%) 


3 (30%) 


4 (36%) 


Other 


0.1 


4 (24%) 


1 (7%) 


7 (54%) 


8 (44%) 


3 (30%) 


3 (27%) 



Abbreviations: CIT, Cartes d'Identite des Tumeurs program; MR, metastasis relapse. 

Expression of ER, PR and ERBB2/HER2 were determined by immunohistochemistry as well as by RNA expression (for greater details see Supplementary 
Methods). TP53 mutation status was determined by the yeast functional assay (Supplementary Methods). P-values for qualitative variables (ER, PR, 
ERBB2/HER2, TP53 mutation, histological type, SBR grading) result from a Fisher exact test. P-values for quantitative variables (median age) result from 
an analysis of variance. MR was determined 5 and 15 years after surgery. Frequency of MR in a subgroup was calculated as the ratio of MR with the total 
number of MR. For each subgroup, percentages of MR in a given site are determined by the number of MR in this site over the whole number of MR in 
the subgroup. MR may occur at more than one site; hence, the sum of percentages may not equate 100. 



we tested our classification on three expression data- 
sets from different technological platforms (Swegene, 
Qiagen/Operon, Eurofins MWG Operon, Roissy, France; 
and Agilent, Santa Clara, CA, USA). Our prediction 
rule being designed for Affymetrix datasets we had to 
adapt it to different technological contexts (Supplemen- 
tary Methods). Overall molecular subgroups were repro- 
duced on different platforms (Supplementary Table 4b 
and Supplementary Figure 4). Differences were noted 
according to the dataset, which may possibly be due 
to different tumor recruitment in each series. To 
test inter-platform reproducibility, we classified the 
GSE3155 dataset that was analyzed in parallel on two 
dual-color (Agilent and Stanford University, Palo Alto, 
CA, USA) and one uni-color (Applied Biosystems, 
Carlsbad, CA, USA) platforms (Supplementary Table 
4c). Classification on both dual-color datasets snowed a 
90% overlap, suggesting a good inter-platform reprodu- 
cibility. However, overlap dropped dramatically when 
dual and uni-color platforms were compared (48 and 
52%). This indicates that classification rules need 



adaptation to technological specificities of each platform 
to perform optimally. 

Comparison with other molecular classifiers 
We next compared our classification with the Sorlie, Hu 
and Parker centroids (Sorlie et al, 2003; Hu et al, 2006; 
Parker et al, 2009). Variable overlaps were found for 
BasL, LumB, LumA and NormL subgroups (Figure Id). 
However, significant differences were noted for the mApo 
and LumC subgroups, which not only overlapped at 
variable levels with the ERBB2 subtype, but also with 
basal-like, luminal-A and -B, and normal-like groups, 
depending on the classifier (Supplementary Table 5). 
Classification differences affected the distribution of 
bioclinical markers among molecular subgroups. Main 
differences were in the fraction of ER + /PR + and AR + 
tumors in basal-like subtypes and the distribution of 
ERBB2 + tumors (Supplementary Table 6). MFS curves 
showed better separation of good and bad outcome 
subgroups with our classification (the CIT classification) 
(Supplementary Figures 5 and 6). 
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Figure 3 Molecular subgroups show differential activation 
of major signaling pathways: correlations between a given path- 
way and a subgroup are indicated by color boxes. Red boxes 
show upregulation of the pathway, green downregulation. Up or 
downregulation was deduced using KEGGanim tool where relative 
expression measures are projected in the related KEGG pathway 
interaction graph. Pathways showing no clear direction of 
regulation were excluded. 



Figure 2 Breast cancer molecular subgroups show distinctly 
different disease outcome. Kaplan-Meier curves shown in this 
figure represent disease-free survival with metastatic relapse as an 
end point, (a, b) show survival curves in the CIT and validation set, 
respectively. Abrupt breaks in some curves of (a) are related to 
small numbers of patients with long-term follow-up in these 
subgroups. These appear smoothed out in (b) because of greater 
numbers in the validation set. 



Molecular subgroups show differential activation 
of signaling pathways 

We selected 40 cancer relevant pathways from public 
databases and tested for specific enrichment in our 
molecular subgroups (Supplementary Methods). Genes 
specific for each subgroup were identified using four 
algorithms. Pathways were ordered for each subgroup 
on the mean rank of P- values across the four methods. 
As shown in Figure 3, each subgroup was associated to 
different up or downregulated signaling pathways. The 
upregulation of DNA replication and repair in BasL 
and LumB contrasted with its downregulation in 
NormL. The upregulation 4/5 immune system pathways 
in LumC was of further note. These data indicate that 
molecular subgroups relate to different signaling path- 
ways and biological processes. 

Molecular subgroups show specific genomic anomalies 
Of the 537 tumors profiled for RNA expression, 488 
tumors were analyzed by array- CGH (comparative 
genome hybridization). A total of 21 regions of gain 
and 33 regions of loss were found in more than 30% of 



the tumors (Figure 4a, top panel). BasL and LumB 
showed extensive copy-number alterations (CNAs), 
whereas NormL and LumA were the least rearranged. 
Qualitative differences were also apparent (Figure 4a) and 
we searched for CNAs specifically associated to each 
subgroup. BasL and LumB tumors presented the greatest 
number of CNAs with 39 and 46 specific CNAs, 
respectively (Figure 4a, Supplementary Table 7). The 
number of specific events was lower in the other 
subgroups ranging from 2 to 8. Expectedly, amplifications 
at 17ql2 were found in 70% of mApo tumors. LumA 
showed gains at 4q35 and 16pll-pl3, whereas NormL 
tumors could be differentiated from LumA by frequency 
of gains at 9q33, 8p23, 16pl3 and loss at 16ql2. 

CNAs were associated to large-scale gene expression 
modifications. A total of 786 genes comprised in 
intervals of gains or losses showed significantly modified 
expression levels. A number of regions of gains over- 
expressed genes encoding cell cycle and proliferation 
activators and, conversely, known tumor suppressors, 
pro-apoptotic or DNA damage checkpoint genes were 
found downregulated in regions of loss (Supplementary 
Table 7). These findings suggest that CNAs are part of 
a selective process associated with tumor progression, 
with differences from one subgroup to another. In that 
respect, 28 CNAs presented inverse patterns in different 
subgroups. These inverted patterns involved mainly BasL 
and LumB, but were also found in mApo and LumB or 
LumB and NormL (Supplementary Figure 7). Strikingly, 
they were associated to inverse expression of key cancer 



Oncogene 



Molecular classification of breast cancer 

M Guedj et a I 




Sub-group specific regions 



IT* 





L_ 


1 — i — 




i- 




it 




l r 




11 


•if'r 


— i — 


— i 




i 


i 


i 








r i 


i r 




r i IMF 
















































1 












i i 


i i i i i ii i i i i i i i i i 1 1 1 



104 - 

im - 



_ _ 



m so%- 



t — — i — ^ — n — ^ — ^ — n — i i — i i — i — t i ttttttt 























































































II 1111 


1 1 1 1 


i i r 




i i 




1 




..J 




9.1 ... 
Br ' 3* 



j:. 



I 



Figure 4 Breast cancer molecular subgroups present different copy-number change (CNC) profiles. CNC profiles were established 
using genome-wide array-CGH on a 488 breast tumor dataset and subsequently stratified according to the CIT classification. Panel a 
shows frequency of gains (vertical bars going up) or losses (bars going down) at a given location on the genome. Graphs from top to 
bottom correspond to profiles of the whole CIT breast cancer set and each of the six molecular subgroups. Panel b represents regions of 
CNC correlating to a specific subgroup. Specific genomic regions for the whole CIT set are the ones for which the proportion of 
alterations (in gain or loss) exceeded 20%. Subgroup-specific regions are those that present significant increase in proportion (at a 0.1 
FDR level) in a given subgroup tested against all others. Bars represent P-values after a standard logarithmic transformation. 
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genes. These data support the notion that breast cancer 
subgroups arise along distinct genetic pathways. 

Focal DNA amplification (defined as high-level gains 
occurring in regions not larger than 3 Mb) occurred 
significantly more frequently in LumB, mApo and 
LumC than in the other subgroups (Supplementary 
Table 8a). We further investigated the occurrence of 
focal CNAs and analyzed a subset of 72 tumors from the 
CIT discovery set with high-resolution Illumina 610K- 
SNP-arrays (Supplementary Table 8b). We detected 
246 gains and 337 losses (mean size 132 and 161 kb, 
respectively). We noted that 53% of the gains were also 
detected in our BAC-array data, while the overlap was 
lower for losses (19%). However, gains showed modest 
copy-number increase and were infrequently recurrent. 
Losses showed greater recurrence but this corresponded 
mainly to probable CNVs (identical starts and ends). 

We verified the overlap of our subgroups with the 
recently proposed CNA-based classification (Jonsson 
et aL, 2010) and observed an overall coherence with our 
findings. Their CNA-based Basal-complex class over- 
lapped with our BasL, 17ql2 with part of our mApo and 
LumC, Luminal complex and amplifier with LumB and 



LumC, while the Luminal- simple corresponded globally 
to LumA and NormL (Figure Id). 

Fraction of non- tumor cells and distribution in molecular 
subgroups 

The fraction of non-tumor cells is frequently discussed 
as a confounding factor in molecular analyses of breast 
cancer fostering the proposition that the normal-like 
group was a possible artifact (Prat et al, 2010). To get 
an objective estimate of the rate of non-diploid cells in 
our dataset and determine its distribution within mole- 
cular subgroups, we computed an estimate based on 
Illumina 610K-SNP data using a recent formula (Van 
Loo et al., 2010). Significant differences were seen 
among molecular subgroups (Supplementary Figure 8a) 
with, surprisingly, mApo showing the lowest rate of 
non-diploid cells. NormL ranked third and LumA and 
LumB presented the highest fraction of non-diploid 
cells. Our results agreed with recent data (Van Loo 
et al., 2010). However, a variable fraction of tumor cells 
may also be diploid, leading to an overestimation of 
normal cells. To assess this, a histological estimate of the 
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non- tumor cell fraction was performed on the tumors 
analyzed with the Illumina 610K-SNP-arrays. This 
showed that SNP-based estimates of non-diploid cells 
were lower than pathological tumor cell content (Sup- 
plementary Figure 8b). Overall these data are coherent 
with the idea of NormL representing a bonafide breast 
cancer subgroup. 

Breast cancer subgroups and mammary epithelial cell 
hierarchy 

To test whether our subgroups relate to distinct cells of 
origins in the mammary gland, we took advantage of 
three published expression profiling datasets of sorted 
normal mammary epithelial cell subpopulations (Raouf 
et al, 2008; Lim et al, 2009; Pece et al, 2010). We 
inferred a signature that discriminated the mammary 
stem cell (MaSC) enriched, luminal progenitor (LPC), 
mature luminal (MLC) and stromal cell populations, 
and used this signature to classify our breast tumor 
expression data (Supplementary Methods). As shown in 
Figure 5, the principal component analysis ordered 
normal mammary epithelial cell fractions according to 
a differentiation gradient and breast tumors from BasL, 
mApo, LumC, LumB/NormL to LumA, suggesting a 
proximity of BasL and mApo with either MaSC or LPC, 
whereas ER + subgroups showed a gradient between 
LPCs and MLCs. The correlation of BasL and mApo 
with least differentiated cells (MaSC or LPC) in the 
normal mammary gland was confirmed in a second 
analysis (Supplementary Table 9). 

Prognostic significance of molecular subgroups 
We next compared the prognostic significance in terms of 
metastatic relapse of our molecular subgroups to classical 
prognostic factors (ER, ERBB2/HER2, SBR grading and 
nodal involvement). As shown in Table 2, our classifica- 
tion signature performed better in both univariate and 
multivariate analyses than the classical prognostic factors, 
in both the discovery and validation sets. However, the 
absence of central pathology review in both datasets 
prevents us to draw firm conclusions on the independent 
prognostic power of our signature. In a comparative 
analysis with five expression signatures (Sorlie et al, 2003; 
van 't Veer et al, 2003; Hu et al, 2006; Sotiriou et al, 
2006; Parker et al, 2009), our signature came second 
after the van't Veer signature in the discovery set and 
performed best in the validation set (Supplementary 
Table 10), demonstrating the important difference in 
terms of prognosis among molecular subgroups. 



3 Breast cancer samples 
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Figure 5 Principal component analysis (PCA) of the CIT coreset 
expression profiles based on a meta-signature comparing normal 
mammary epithelial cell subpopulations. A 163 gene signature was 
produced by comparing different normal mammary cell contin- 
gents from three independent studies (GSE16997, GSE18931, 
GSE11395) and used in a PCA. Samples from the CIT coreset 
(panel a) and normal mammary gland samples (panel b) from 
GSE 16997 were projected in the two first principal components in 
the upper and lower panel, respectively. 



sufficient sample size by subgroup. Despite different 
chemotherapy protocols in individual cohorts, obvious 
differences in response were observed. BasL and mApo 
showed the best response rates with 50%, and 
37% of complete response, respectively. ER + sub- 
groups showed 15% of complete response in LumB/ 
LumC tumors and 0% in LumA/NormL (Table 3a). 
Prediction of complete pathological response of the CIT 
classification was then compared with that of ER status 
and SBR Grade in the three pooled datasets. Both in the 
univariate and multivariate analysis the CIT classifica- 
tion showed the strongest score (Table 3b). 



Discussion 



Molecular subgroups show differential response 
to chemotherapy 

To test whether our classification could predict chemo- 
therapy response, we analyzed three datasets of locally 
advanced breast cancers treated by neoadjuvant therapy 
followed by surgery and assessment of the pathological 
response. ER- breast cancers were overrepresented in 
the three cohorts, but our signature allowed the assign- 
ment of tumors to four subgroups after pooling LumB 
and LumC, as well as LumA and NormL to reach 



Breast cancer heterogeneity, reflected in molecular 
subgroups, can be attributed to differences in molecular 
alterations, cellular origin or both. We present a classi- 
fication of breast cancer into six molecular subgroups, 
which differed upon gene expression, genomic profiles, 
differentiation level and clinical features. 

First, gene expression differences strongly suggested 
that they outlined distinct biological entities, reflecting 
initiating mutations and/or cell-of-origin. Specific sets of 
signaling pathways were associated to each subgroup. 
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Table 2 Prognostic significance of the CIT classification 

Variable Univariate analysis Multivariate analysis 



Value HR 95% CI P-value P-value n HR 95% IC P-value P-value 
modality model modality model 
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Chemotherapy adjuvant 


Yes 


1.09 


0.73-1.62 


0.67 


0.67 
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(ref = No) 




















Hormononal adjuvant 


Yes 


0.64 


0.44-0.94 


0.02 


0.02 


375 








(ref = No) 




















b ) Molecular signatures 




















CIT (ref = NormL) 


LumA 


1.66 


0.84-3.30 


1.8 x lO" 5 




426 


2.0 


0.74-5.33 


3.7 x 10" 1 




Other 


3.16 


1.82-5.48 






426 


1.8 


0.63-5.06 




Sorlie (ref = NormL) 


LumA 


1.37 


0.74-2.52 


2.4 x 10" 3 




426 


1.9 


0.7-5.03 


5.0 x 10" 1 




Other 


2.29 


1.31-4.00 






426 


1.4 


0.57-3.29 




Hu (ref = LumA) 


NormL 


1.67 


0.86-3.25 


9.6 x lO" 5 




426 


2.7 


0.98-7.35 


4.2 x 10" 1 




Other 


2.88 


1.69-4.93 






426 


1.6 


0.72-3.73 




Parker (ref = LumA) 


NormL 


1.43 


0.74-2.75 


3.5 x lO" 3 




426 


1.25 


0.49-3.18 


3.5 x 10" 1 




Other 


2.26 


1.34-3.81 






426 


0.8 


0.36-1.79 




GGI (ref = Low risk) 


High risk 


2.51 


1.60-3.93 


3.4 x lO" 5 




426 


1.0 


0.43-2.47 


8.0 x 10" 1 


Van't Veer (ref = Low risk) 


High risk 


2.93 


2.00-4.27 


5.9 x lO" 9 




426 


2.8 


1.53-5.1 


3.4 x lO" 3 



Abbreviations: CI, confidence-interval; CIT, our classification; HR, hazard ratio. 

Relative risk was calculated taking metastatic relapse as an endpoint and compared with that of (a) clinical parameters and (b) of three molecular 
classifiers (Sorlie, Hu, Parker) and two prognostic signature (GGI, Van't Veer). The dataset comprised 426 patients from the CIT discovery set for 
which MFS information was available. Complete clinical information was available in 371 cases explaining the smaller numbers in the multivariate 
analysis on prognostic factors. Prognostic significance was assessed by applying a Cox model. Columns refer to the HR, the 95% CI and the 
P-values for both univariate and multivariate models. 



The distribution of the six subgroups was determined 
by the combination of the expression of three large gene 
clusters organized around the (i) estrogen receptor, 
(ii) androgen receptor and (iii) cell cycle regulator genes. 
The ER cluster is well known as defining luminal breast 
tumors (Bertucci et aL, 2006) and the expression of 
AR in breast cancer is long-known (Isola, 1993), but has 
been confounded with that of the ER cluster (Doane 
et aL, 2006). Its combined expression with the ER cluster 
yields three broad classes determined by nuclear receptor 
expression; AR— /ER— /PR— (triple negative) corres- 
ponding to the BasL subgroup, AR + /ER-/PR- 
(mApo), AR + /ER + /PR + (triple positive) including 
the four ER+ subgroups. The AR cluster comprises key 
genes previously associated to the ER cluster, such as the 
pioneer factor FOXA1, which recruits ER, AR and 
RAR/RXR (Carroll et ai, 2006; Lupien et ai, 2008). 

The existence of an ER-/AR + breast tumor subset 
(our mApo subgroup) has been proposed (Farmer et ai, 
2005; Doane et aL, 2006), and its important overlap 
with ERBB2/HER2 amplification is intriguing, possibly 
reflecting cross-talks between the AR and ERBB2/HER2 
pathways (Naderi & Hughes-Davies, 2008). How- 
ever, it is notable that our classification did not define 
an ERBB2 subgroup. Instead, ERBB2-amplified cancers 
distributed in mApo (ER— ) and LumC (ER + ) sub- 
groups. We found less expression differences between 
mApo/ERBB2 + and mApo/ERBB2- than between 



mApo and LumC tumors (Supplementary Figure 9). 
Interestingly, Staaf et ai (2010) showed that ER— and 
ER+ ERBB2-amplified tumors presented different 17q 
CNA patterns. These observations could have implica- 
tions in the clinic as they indicate that ERBB2 + breast 
cancer correspond to a biologically heterogeneous group. 
Moreover, it seems important to distinguish ERBB2 + 
and mApo tumors, because the so-called triple-negative 
group comprises both BasL and ERBB2-/mApo tumors 
despite clear molecular and clinical differences. 

Second, subgroups were also characterized by 
different patterns of genomic anomalies. These data 
were concordant with previous results (Chin et aL, 2006; 
Natrajan et aL, 2009) and the CGH classification 
recently proposed by Jonsson et aL (2010). Moreover, 
the existence of chromosomal regions showing inverse 
patterns (gain in one subgroup/loss in another) further 
supported the notion that these subgroups progress 
along distinct genetic routes, which possibly involve 
different mechanisms of genetic instability. 

Third, our data indicated that subgroups differed 
in their differentiation level, pointing to possible 
differences in cell-of-origin. This was suggested by 
similarities between the transcriptome of distinct cellular 
contingents in the normal mammary gland and those of 
molecular subgroups. While BasL and mApo showed 
proximity to MaSC or luminal progenitors, ER + 
subgroups formed a gradient between LPCs (LumC) 
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Table 3 Differential response to chemotherapy according to molecular subgroups of the CIT classification 





Treatment 


n 


Response 


BasL 


mApo 


LumC/LumB 


LumA/NormL 


P -value 


(a) Correlation 
Hess 


between the molecular subgroups of the CIT classification and pathological complete response to chemotherapy 

T/FAC 125 pCR 17(68%) 11(32%) 3(7%) 0(0%) 2.6 x 10" 9 
no pCR 8 (32%) 23 (68%) 41 (93%) 22 (100%) 


CIT 


EC 


58 


pCR 
no pCR 


8 (53%) 
7 (47%) 


6 (46%) 

7 (54%) 


2 (7%) 
25 (93%) 


0 (0%) 
3 (100%) 


1.6 x lO" 3 


Bonnefoi 


FEC 


66 


pCR 
no pCR 


16 (43%) 
21 (57%) 


7 (41%) 
10 (59%) 


5 (42%) 
7 (58%) 




NS 




TET 


58 


pCR 
no pCR 


17 (45%) 
21 (55%) 


6 (35%) 
11 (65%) 


3 (100%) 
0 (0%) 




0.11 


Total 




307 


pCR 
no pCR 


58 (50%) 
57 (50%) 


30 (37%) 
51 (63%) 


13 (15%) 
73 (85%) 


0 (0%) 
25 (100%) 


4.3 x lO" 10 








Univariate Analysis 






Multivariate Analysis 




Value 


n 


Odds ratio 


95% CI 


P -value 


n 


Odds ratio 


95% CI P-value 



(b) Uni- and multivariate analyses of factors predictive for pathological complete response to chemotherapy in the three pooled datasets 



ER 


ER- 


307 


4.5 


2.5- 


-8.4 


2.1 x 10" 08 291 


1.6 


0.67-4.2 


0.28 


Grade 


Grade3 


291 


3.2 


1.8 


-5.8 


3.6 x lO" 05 


1.9 


1-3.5 


0.04 


CIT molecular classification 


BasL/mApo 


307 


6.1 


3.1 


13 


7.0 x 10" 10 


3.8 


1.3-11 


0.01 



Abbreviations: CIT, our classification; NS, not significant; pCR, pathological complete response. 

Table 3a shows the correlation between pCR and CIT molecular subgroups. pCR and absence of response (no pCR) to chemotherapy were 
analyzed in three clinical trials (Hess et ah, 2006, Bonnefoi et ah, 2007, CIT set). Owing to the small number of data, four main subgroups and two 
intermediate subgroups were combined into two groups: (LumB; LumC; LumB/C) and (NormL; LumA; NormL/LumA). Treatment description: 
EC, six cycles of a dose-dense regimen of 75mg/m 2 epirubicin and 1200 mg/m 2 cyclophosphamide, given every 14 days; T/FAC, 24 weeks 
of sequential paclitaxel and fluorouracil-doxorubicin-cyclophosphamide; FEC, fluorouracil, epirubicin, and cyclophosphamide for six cycles; 
TET, docetaxel for three cycles followed by epirubicin plus docetaxel for three cycles. Correlations were calculated using Fisher exact test. Table 3b 
shows uni- and multivariate analyses of factors predictive of pCR in the three pooled datasets. Univariate analysis was done using the Fisher exact 
test and multivariate analysis by logistic regression. 



and MLC cells (LumA). Our findings are consistent with 
recent work suggesting that LPCs were the cells of origin 
of basal cancer and Brcal mammary tumors (Lim et ah, 
2009; Molyneux et ah, 2010). These findings bring 
insight on the prevalence of Grade 3 tumors in BasL and 
mApo contrasting sharply with that of low-grade 
cancers in NormL and LumA. Our data thus suggest 
that breast cancer may arise from at least two distinct 
cell types and that the final phenotype will result from 
genetic and epigenetic changes occurring during cancer 
progression. This may also have some link with the 
striking gradient of TP53 mutations observed between 
BasL and NormL subgroups. The correlation with 
elevated expression of the cell-cycle cluster and in- 
creased genomic instability was also notable. Moreover, 
there is a striking parallel between the incidence of TP53 
inactivation and the response rates of neoadjuvant 
chemotherapies. These data are in line with our previous 
observation proposing that TP53 is not the mediator of 
chemotherapy-induced cell death (Bertheau et ah, 2007). 

Fourth, molecular subgroups show striking differ- 
ences with respect to metastatic relapse both in terms of 
kinetics and site of recurrence. While BasL and mApo 
tumors preferentially metastasized to the brain and 
rarely to the bone, ER + subgroups exhibited an inverse 
pattern, strengthening previous studies (Smid et ah, 
2008). Our data suggest that these differences could be 
due to differential expression of key metastasis genes 



(Bos et ah, 2009). Hence, metastasis to a specific organ 
can also be the result of a subgroup-specific gene 
program and coexist with the de novo acquisition of 
stochastic mutations, as recently shown by massively 
parallel sequencing work (Ding et ah, 2010; Yachida 
et ah, 2010). Outcomes of the different subgroups were 
very different as well. BasL and mApo showed earlier 
relapse, but a remarkably stable MFS for the next 100 
months. In contrast, although all ER + subgroups did 
better during the first years, a continuous incidence of 
late relapse was observed. LumB and LumC outcome 
progressively became worse than that of BasL or mApo. 
However, a number of recurrences occurring after 
5 years in ER + subgroups are probably linked to 
interruption of anti-estrogen treatments. 

The status of the NormL subgroup is of particular 
interest because its existence has been put in doubt 
and attributed to an elevated content of normal cells 
(Prat et ah, 2010). In line with recently published data 
(Van Loo et ah, 2010), we showed that NormL tumors 
did not present a lower fraction of non-diploid cells 
than mApo or LumC. Furthermore, our data showed 
that 70% of NormL tumors showed loss at 16q, further 
supporting that this subgroup does not result from a 
co-cluterization of breast tumors presenting smaller 
fractions of tumor cells. 

Our results are in favor of the existence of different 
breast cancer subtypes bearing distinct biologies and 
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clinical courses. We propose that stratifying breast 
cancers according to such a classification could be highly 
beneficial when searching for new prognostic or response 
to treatment indicators. These would be subgroup specific 
instead of expressing the differences between highly and 
poorly proliferating tumors. Furthermore, such a classi- 
fication, once adapted in a format compatible with 
clinical setting, could efficiently contribute to disease 
management. Indeed, the different subgroups outlined 
here occur in different age groups, metastasize to different 
organs and exhibit distinct survival kinetics. Similarly, the 
association with immune system activation pathways in 
LumC may be indicative for an anti-tumor immunity in 
this specific subgroup. All of these are clear indications 
that they represent distinct clinical and biological entities. 



Materials and methods 

Patients and tumors 

A total of 724 primary breast carcinomas were collected and 
analyzed for expression profiling on Affymetrix U133-Plus2.0 
chips and a subset of 488 samples were analyzed by array-CGH. 
In addition, 58 fine-needle aspiration biopsies from patients 
undergoing neoadjuvant chemotherapy were analyzed by trans- 
criptome and included in the response-to-chemotherapies set. 
Full description can be found in Supplementary Table 1 and 
Table 2. Mean follow-up time was of 65 months. Four RNA 
from normal human breast tissue were used as reference. Histo- 
logical grade as well as ER, PR and HER2 levels determination 
are detailed in the Supplementary Methods. 

Discovery and validation sets 

Our 724 breast tumor transcriptome dataset was split in a CIT- 
discovery-set comprising 537 (75%) tumors of which 488 were 
analyzed by array-CGH and 187 (25%) cases were set apart 
for the validation-set. The Affymetrix validation set comprised 
the 187 samples from CIT and 2225 transcriptomes collected 
from GEO and array-express (Supplementary Table 2). 

Expression profiling and data analysis 

RNA profiling. Methods used for RNA purification, quality 
control, fluorescent probe production, hybridization and data 
processing were essentially as previously described (de Reynies 
et al., 2009). 

Transcriptome analysis and molecular subgroup determination 
Our rational was to ensure the greatest possible homogeneity to 
identified subgroups. Subgroup determination was based on the 
CIT-disco very- set including 537 transcriptomes and a clustering 
approach iterating unsupervised and supervised steps (Supple- 
mentary Figure 1, Supplementary Methods). Microarray data 
were first classified with a set of 244 most variant probe sets 
using in parallel hierarchical clustering, /c-means and Gaussian 
mixture model. Tumors that were assigned to the same group by 
the three methods were kept, defining a coreset of 355 tumors. 
Based on this coreset most discriminative genes were selected by 
analysis of variance and ranked by random-forest, producing a 
256 gene signature, leading to the identification of six homo- 
geneous molecular subgroups. Validation datasets were inde- 
pendently classified in the CIT molecular subgroups by applying 
a classical distance-to-centroid approach, implemented in the 
citbcmst R package available at the following URL http:// 
cran.r-project.org/web/packages/citbcmst/index.html and com- 
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ing with a (Sweave) user documentation. The complete classifi- 
cation procedure is detailed in the Supplementary Methods. 

Comparison with the Sorlie, Hu and Parker classifiers 
Sorlie (Sorlie et al, 2003), Hu (Hu et al, 2006) and Parker 
(Parker et al, 2009) centroids were retrieved from http:// 
genome-www.stanford.edu/breast_cancer/robustness/data/Intrinsic 
GeneList.txt, https://genome.unc.edu/pubsup/breastTumor/data/ 
306genes-X-249samples-X-5subtypes + 5centroids.xls and https:// 
genome.unc.edu/pubsup/breastGEO/pam50_centroids.txt, respec- 
tively. To build the classifiers corresponding clone UniGenelDs 
were mapped to Affymetrix (U133A or U133Plus2) probe sets. For 
Sorlie this was possible for 334 UniGene_IDs gene symbols, 
for Hu 232 UniGene_IDs and Parker all genes could be 
directly mapped. 

Comparison with the Jonsson array-CGH-based classification 
The 6 Jonsson centroids are relative to genomic regions 
determined with the GISTIC algorithm (Jonsson et al, 2010). 
Details are provided in the Supplementary Methods. 

Cancer pathways analysis 

Cancer relevant pathways were retrieved from KEGG (ftp:// 
ftp.genome.ad.jp/pub/kegg/pathways/hsa), Biocarta (http:// 
www.biocarta.com) and GO (http://www.geneontology.org/), 
and related genes were mapped to non-redundant HUGO Gene 
symbols. Four gene set analysis methods were used (Supple- 
mentary Methods), yielding P-values, which were transformed 
into ranks. Gene sets were ranked by order of interest according 
to the mean of the ranks across the four methods. 

Array-CGH 

Array-CGH was performed on a 4434 BAC-array with a 
median resolution of 0.6 Mb. DNA labeling, hybridization and 
data processing are as described in the Supplementary Methods. 

Statistical tests 

Clinical correlations were determined by x 2 f° r qualitative 
factors and analysis of variance for quantitative variables. 
Disease outcome was investigated with Kaplan-Meier curves 
using metastatic recurrence as an endpoint and subgroup for 
stratification. MFS was calculated from the date of diagnosis 
until first metastatic relapse. P-values at 60 and 180 months 
resulted from a log-rank test on Cox estimates. Benjamini and 
Hochberg method was applied for multiple-testing adjustment. 
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